Sentence-based Document Size Reduction

نویسندگان

  • Peter Schönhofen
  • Hassan Charaf
چکیده

In this article we present a novel document size reduction method that selects characteristic sentences by recognising fundamental semantical structures. With the help of document size reduction, document clustering processes less information, while also avoids misleading content. Sentence selection is carried out in two steps. First, a graph representing fundamental sentence relationships, measured by the number of common words, is constructed. Second, various statistical properties of this graph is computed and fed to a backpropagated neural network, which then chooses a small fraction of sentences deemed to be relevant. Preliminary experiments employing the Reuters-21578 news corpus proved that selection of lead sentences (which summarise each news article) can be more reliably performed based on the sentence relationship graph than on the traditional tf and tf×idf measurements. Experiments showed that the presented method can substitute tf and tf×idf for document clustering.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Text's Terms and Syntactical Properties for Document Similarity

This paper reports on experiments performed to investigate the use of syntactical structures of sentences combined with sentences' terms for document similarity calculation. The document's sentences were first converted into ordered Part of Speech (POS) tags that were then fed into the Longest Common Subsequence (LCS) algorithm to determine the size and count of the LCSs found when comparing th...

متن کامل

Multi-candidate reduction: Sentence compression as a tool for document summarization tasks

This article examines the application of two single-document sentence compression techniques to the problem of multi-document summarization—a “parse-and-trim” approach and a statistical noisy-channel approach. We introduce the Multi-Candidate Reduction (MCR) framework for multi-document summarization, in which many compressed candidates are generated for each source sentence. These candidates a...

متن کامل

Sentence Reduction Algorithms to Improve Multi-document Summarization

Multi-document summarization aims to create a single summary based on the information conveyed by a collection of texts. After the candidate sentences have been identified and ordered, it is time to select which will be included in the summary. In this paper, we describe an approach that uses sentence reduction, both lexical and syntactic, to help improve the compression step in the summarizati...

متن کامل

Optimizing an Approximation of ROUGE - a Problem-Reduction Approach to Extractive Multi-Document Summarization

This paper presents a problem-reduction approach to extractive multi-document summarization: we propose a reduction to the problem of scoring individual sentences with their ROUGE scores based on supervised learning. For the summarization, we solve an optimization problem where the ROUGE score of the selected summary sentences is maximized. To this end, we derive an approximation of the ROUGE-N...

متن کامل

Feature selection based on word–sentence relation1

Feature selection proved to improve both the speed and the quality of classification. Methods such as mutual information, information gain or chi-square are all based on the joint distribution of classes and words; there exist only a few methods which exploit contextual information for feature selection. We introduce an algorithm based on word and word pair frequencies that reduces both vocabul...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004